Loading required packages, source files and data.
library(readr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(highcharter)## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(ggplot2)
source('../unbalanced_functions.R')## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Loading required package: lattice
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
data <- read_csv('../data/murder_suicide.csv')## Parsed with column specification:
## cols(
## .default = col_integer(),
## Sex = col_character(),
## MaritalStatus = col_character(),
## InjuryAtWork = col_character(),
## MethodOfDisposition = col_character(),
## Autopsy = col_character(),
## Icd10Code = col_character()
## )
## See spec(...) for full column specifications.
با توجه به سوالات مرگ و میر در آمریکا به سوالات زیر پاسخ دهید.
۱. از میان متغیرهای داده مرگ و میر یک زیرمجموعه ایی بدون حشو در نظر بگیرید. ماتریس همبستگی متغیرهای مختلف را به دست آورده و سپس رسم نمایید. علاوه بر این نمودار پراکنش متغیرهای انتخاب شده را همزمان نسبت به هم رسم نمایید.
In order to find a non redundant set of attributes, we started by semantically analysing them and removing the calculated ones, the ones with just a single value, and the ones in which one value was the absolute dominant one.
We’ve also removed the semanticless, artificially added attributes such as Id.
The result is called non redundant column names.
In order to further minimize the set, we’ve ommited the attributes with caotic behaviours such as MonthOfDeath and WeekDayOfDeath.
We’ve also deleted the attributes with very little absolute correlation to others. such as HispanicOriginRaceRecode.
The resulting set is called minified non redundant column names.
Then we’ve drawn three different correlation plots for the resulting data. The third one also has a confidence level analysis built-in.
Finally we’ve drawn the Distribution diagram for every pair of the 14 remaining attributes.
###############################################################################################
## Q1
data %>% colnames -> colnames
data %>% filter(AgeType == 1) -> data
colnames %>% setdiff(c('EducationReportingFlag', 'AgeType', 'AgeRecode12',
'AgeRecode27', 'AgeRecode52', 'InfantAgeRecode22',
'RaceImputationFlag', 'RaceRecode3', 'RaceRecode5',
'BridgedRaceFlag', 'CurrentDataYear', 'AgeSubstitutionFlag',
'CauseRecode358', 'CauseRecode113', 'InfantCauseRecode130',
'HispanicOrigin', 'Id')) -> non_redundant_colnames
non_redundant_colnames %>% setdiff(c('HispanicOriginRaceRecode', 'NumberOfEntityAxisConditions',
'NumberOfRecordAxisConditions', 'DayOfWeekOfDeath', 'Race',
'MonthOfDeath', 'PlaceOfDeathAndDecedentsStatus')) -> minified_non_redundant_colnames
data %>% select(non_redundant_colnames) %>%
mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) -> q1_nr_data
data %>% select(minified_non_redundant_colnames) %>%
mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) -> minified_q1_nr_data
minified_q1_nr_data %>% cor(use = 'pairwise.complete.obs') -> q1_cor
q1_cor %>% hchart()library(corrplot)## corrplot 0.84 loaded
minified_q1_nr_data %>% cor.mtest(conf.level = .95) -> res1
q1_cor %>% corrplot(type = 'upper')q1_cor %>% corrplot(type = 'upper', p.mat = res1$p, sig.level = 1e-90, pch.cex = 1.5)minified_q1_nr_data %>% slice(1:500) %>% plot()۲. اثر هر یک از متغیرهای جنسیت، نژاد،آموزش، سن و نحوه تدفین را بر مرگ یا خودکشی ارزیابی کنید.
In order to test the Null hypothesis that these variables have no effect on the Manner of Death, we’ve used the Kruskal-Wallis non parameteric chi-squared test.
In order to further understand the relations of this variables on the Manner of Death, we’ve visualized the relations using the proper diagrams.
The results are somewhat interesting!
Firstly the null hypothesis for all of the Sex, Race, Education, Age and MethodOfDisposition variables is confidently rejected.
Here are some interesting atatements:
- Suicide is more common among women than men; but men are relatively subject to more murders than women.
Overally suicide is much more common than murder.- We know for a fact that there is no problem with suiciding in the religious beliefs of south westeners of asia. The diagram also is proving this point with high relative value of suicide to murder.
We can note than the statistics of murder are lower than suicide in all races but the black race!
The cause is maybe racism acts against black people. Also the high ratio of murder in native american people is considerable.- We can note that the rate of suicide increases with increase in level of education!
- As the age increases. the rate of suicide goes up too.
There is also a high rate of suicide among teenagers enduring the maturity phase.- We can note that the rate of suicide is high amongst nations which burn the body of deceased. We also know for a fact that burining the deceased is common among middle and east asian contries.
###############################################################################################
## Q2
# cbind(
# data %>% select(MethodOfDisposition),
# data %>% mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) %>% select(MethodOfDisposition)
# ) %>% View()
q1_nr_data %>% mutate(ms = ifelse(MannerOfDeath %in% c(2, 6), 'Suicide',
ifelse(MannerOfDeath == 3, 'Murder', 'Other'))) -> q2_data
############################
male_count <- q2_data %>% filter(Sex == 2) %>% count() %>% as.integer()
female_count <- q2_data %>% filter(Sex != 2) %>% count() %>% as.integer()
q2_data %>% group_by(Sex, ms) %>% summarise(count = n()) %>% ungroup() %>%
mutate(Sex = ifelse(Sex == 2, 'Male', 'Female')) %>%
mutate(ratio = ifelse(Sex == 'Male', count/male_count, count/female_count)) %>%
hchart(type = 'column', hcaes(x = ms, y = ratio, group = Sex))kruskal.test(ms ~ Sex, data = q2_data) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by Sex
## Kruskal-Wallis chi-squared = 33.535, df = 1, p-value = 7.001e-09
############################
race_counts <- q2_data %>% group_by(Race) %>% summarise(race_count = n())
race_names <- read_csv('../data/Race.csv')## Parsed with column specification:
## cols(
## Code = col_integer(),
## Description = col_character()
## )
full_join(
race_counts,
race_names %>% rename(Race = Code),
by = 'Race'
) %>% full_join(
q2_data,
by = 'Race'
) %>% group_by(Race, ms, race_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
mutate(ratio = count/race_count) %>% .[complete.cases(.),] %>%
hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))kruskal.test(ms ~ Race, data = q2_data) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by Race
## Kruskal-Wallis chi-squared = 15366, df = 13, p-value < 2.2e-16
############################
q2_data_1989 <- q2_data %>% filter(Education1989Revision != 0)
q2_data_2003 <- q2_data %>% filter(Education2003Revision != 0)
edu_1989_names <- read_csv('../data/Education1989Revision.csv')## Parsed with column specification:
## cols(
## Code = col_integer(),
## Description = col_character()
## )
edu_1989_names %>% mutate(Description = ifelse(Description == 'Years of elementary school', paste(Code, Description, sep = ' '), Description)) -> edu_1989_names
full_join(
q2_data %>% group_by(Education1989Revision) %>% summarise(edu_count = n()) %>% filter(Education1989Revision != 0),
edu_1989_names %>% rename(Education1989Revision = Code),
by = 'Education1989Revision'
) %>% full_join(
q2_data_1989,
by = 'Education1989Revision'
) %>% group_by(Education1989Revision, ms, edu_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
mutate(ratio = count/edu_count) %>% .[complete.cases(.),] %>%
hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))kruskal.test(ms ~ Education1989Revision, data = q2_data_1989) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by Education1989Revision
## Kruskal-Wallis chi-squared = 292.21, df = 17, p-value < 2.2e-16
############################
edu_2003_names <- read_csv('../data/Education2003Revision.csv')## Parsed with column specification:
## cols(
## Code = col_integer(),
## Description = col_character()
## )
full_join(
q2_data %>% group_by(Education2003Revision) %>% summarise(edu_count = n()) %>% filter(Education2003Revision != 0),
edu_2003_names %>% rename(Education2003Revision = Code),
by = 'Education2003Revision'
) %>% full_join(
q2_data_2003,
by = 'Education2003Revision'
) %>% group_by(Education2003Revision, ms, edu_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
mutate(ratio = count/edu_count) %>% .[complete.cases(.),] %>%
hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))kruskal.test(ms ~ Education2003Revision, data = q2_data_2003) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by Education2003Revision
## Kruskal-Wallis chi-squared = 3315.7, df = 8, p-value < 2.2e-16
############################
full_join(
q2_data %>% filter(Age < 250) %>% group_by(Age) %>% summarise(age_count = n()),
q2_data,
by = 'Age'
) %>%
group_by(Age, ms, age_count) %>% summarise(count = n()) %>% ungroup() %>%
mutate(ratio = count/age_count) %>% .[complete.cases(.),] %>%
hchart(type = 'spline', hcaes(x = Age, y = ratio, group = ms))q2_data %>% filter(Age < 250) %>%
ggplot(aes(y = Age, x = ms)) + geom_boxplot(notch=TRUE, outlier.colour="red", outlier.shape=8)kruskal.test(ms ~ Age, data = q2_data) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by Age
## Kruskal-Wallis chi-squared = 6313.6, df = 103, p-value < 2.2e-16
############################
disposition_names <- read_csv('../data/MethodOfDisposition-custom.csv')## Parsed with column specification:
## cols(
## Code = col_character(),
## Description = col_character()
## )
disposition_count <- q2_data %>% group_by(MethodOfDisposition) %>% summarise(dispo_count = n())
cbind(
disposition_names,
disposition_count
) %>% full_join(
q2_data,
by = 'MethodOfDisposition'
) %>% group_by(ms, Description, dispo_count) %>% summarise(count = n()) %>% ungroup() %>%
mutate(ratio = count/dispo_count) -> q2_dispo_data
q2_dispo_data %>% hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))q2_dispo_data %>% hchart(type = 'column', hcaes(x = Description, y = count, group = ms))kruskal.test(ms ~ MethodOfDisposition, data = q2_data) %>% print##
## Kruskal-Wallis rank sum test
##
## data: ms by MethodOfDisposition
## Kruskal-Wallis chi-squared = 4407.9, df = 6, p-value < 2.2e-16
۳. با استفاده از مدل رگرسیون لاجستیک یک مدل به داده ها برازش دهید و سپس آن را نقص یابی کنید.
Firstly we mutate the MannerOfDeath attribute to be a binary indicator of murder or suicide.
Then we create our logistic regression model based on the mutated minified data from Q1.
The analysis of models suggests that the Education1989Revision, InjuryAtWork attributes may have no effect on the MannerOfDeath.
Using the diagnosis we improve our model, removing these variables from it.
The analysis of the final model can be found below.
###############################################################################################
## Q3
minified_q1_nr_data %>% mutate(MannerOfDeath = MannerOfDeath %in% c(2, 6)) -> q3_data
q3_data %>% filter(Age < 250) -> q3_data
glm(MannerOfDeath~., data = q3_data, family = binomial(link = 'logit')) -> q3_logit_model_1## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q3_logit_model_1 %>% summary.glm() %>% print##
## Call:
## glm(formula = MannerOfDeath ~ ., family = binomial(link = "logit"),
## data = q3_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.0712 -0.0702 0.0885 0.2629 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 45.637315 1.399101 32.619 < 2e-16 ***
## ResidentStatus 0.109321 0.031126 3.512 0.000444 ***
## Education1989Revision 0.008698 0.005362 1.622 0.104804
## Education2003Revision 0.180788 0.011872 15.228 < 2e-16 ***
## Sex 0.354360 0.044982 7.878 3.33e-15 ***
## Age 0.017591 0.001111 15.829 < 2e-16 ***
## MaritalStatus -0.153787 0.020146 -7.634 2.28e-14 ***
## InjuryAtWork -0.071879 0.055300 -1.300 0.193674
## MethodOfDisposition 0.119863 0.014074 8.517 < 2e-16 ***
## Autopsy -1.400189 0.042340 -33.070 < 2e-16 ***
## ActivityCode -0.284778 0.004746 -60.000 < 2e-16 ***
## PlaceOfInjury -0.168226 0.002653 -63.414 < 2e-16 ***
## Icd10Code -0.239010 0.003040 -78.620 < 2e-16 ***
## CauseRecode39 0.330360 0.044565 7.413 1.24e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 70507 on 59695 degrees of freedom
## Residual deviance: 18947 on 59682 degrees of freedom
## AIC: 18975
##
## Number of Fisher Scoring iterations: 14
glm(MannerOfDeath~.-InjuryAtWork-Education1989Revision, data = q3_data, binomial(link = 'logit')) -> q3_logit_model_2## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q3_logit_model_2 %>% summary.glm() %>% print##
## Call:
## glm(formula = MannerOfDeath ~ . - InjuryAtWork - Education1989Revision,
## family = binomial(link = "logit"), data = q3_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.0807 -0.0702 0.0889 0.2632 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 45.522912 1.395182 32.629 < 2e-16 ***
## ResidentStatus 0.107518 0.031107 3.456 0.000548 ***
## Education2003Revision 0.176848 0.011546 15.317 < 2e-16 ***
## Sex 0.352952 0.044970 7.849 4.20e-15 ***
## Age 0.017624 0.001110 15.882 < 2e-16 ***
## MaritalStatus -0.152723 0.020145 -7.581 3.43e-14 ***
## MethodOfDisposition 0.131316 0.011303 11.617 < 2e-16 ***
## Autopsy -1.395587 0.042101 -33.149 < 2e-16 ***
## ActivityCode -0.284842 0.004748 -59.995 < 2e-16 ***
## PlaceOfInjury -0.168442 0.002647 -63.634 < 2e-16 ***
## Icd10Code -0.239098 0.003038 -78.702 < 2e-16 ***
## CauseRecode39 0.331384 0.044532 7.442 9.95e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 70507 on 59695 degrees of freedom
## Residual deviance: 18945 on 59684 degrees of freedom
## AIC: 18969
##
## Number of Fisher Scoring iterations: 13
۴. با استفاده از سه نمودار خروجی مدل را نسبت به داده واقعی ارزیابی کنید.
We’ve analysed model output of the real data using four diagrams.
The results are followed:
###############################################################################################
## Q4
q3_data %>% mutate(prediction = fitted(q3_logit_model_2)) %>%
ggplot(aes(x = Age, y = prediction, color = MannerOfDeath)) + geom_point()q3_data %>% mutate(prediction = predict(q3_logit_model_2, type = 'response')) %>%
mutate(MannerOfDeath = as.integer(MannerOfDeath)) %>%
ggplot(aes(x = Age, y = MannerOfDeath)) +
geom_point(aes(x = Age, y = prediction), color = 'red', alpha = 0.2) +
geom_point(alpha = 0.005, size = 3)library(ggthemes)
q3_data %>% mutate(prediction = predict(q3_logit_model_2, type = 'response')) %>%
ggplot(aes(fitted(q3_logit_model_2), color = as.factor(MannerOfDeath))) +
geom_density(size = 0.5) +
ggtitle("Training Set's Predicted Score") +
scale_color_economist(name = "data", labels = c("negative", "positive"))table(q3_data$MannerOfDeath,ifelse(fitted(q3_logit_model_2)>0.3,1,0)) %>% plot()۵. ابتدا ۲۰ درصد داده را به صورت تصادفی به عنوان تست در نظر بگیرید. مدل را با استفاده از ۸۰ درصد باقی مانده برازش دهید. با استفاده از پارامتر قطع ۰.۵ نتایج را برای داده تست پیش بینی کنید. سپس کمیت های زیر را محاسبه کنید.
- P: positive samples
- N: negative samples
- TP: true positive TP (eqv. with hit)
- TN: true negative (eqv. with correct rejection)
- FP: false positive (eqv. with false alarm, Type I error)
- FN: false negative (eqv. with miss, Type II error)
- Accuracy (ACC) ACC = (TP+TN)/(P+T)
- False positive rate (FPR): 1- TN/N
- True positive rate (TPR): TP/P
مشابه آنچه در کلاس گفته شد نمایشی از چهار کمیت TN, TP,FP,FN به همراه داده ها رسم نمایید.
Firstly, the data is randomly devided into a 20% testing data set and a 80% modeling dataset. Then the model is trained using the training dataset and requested statistics are calculated for the model.
Then the musaic plot is drawn to show the model accuracy and type I and II errors.
Then the confusion matrix plot is drawn. It’s somewhat based on the same idea as the mosaic plot.
After that the diagram of training/test accuracy is drawn for different cutoffs in two different levels of magnification.
###############################################################################################
## Q5
q3_data %>% nrow -> n_data
q5_sample <- sample(1:n_data, size = 0.8*n_data, replace = FALSE)
q3_data %>% mutate(MannerOfDeath = as.integer(MannerOfDeath)) %>% select(-Education1989Revision,-InjuryAtWork) -> q5_data
q4_model_data <- q5_data %>% .[q5_sample,]
q4_test_data <- q5_data %>% .[-q5_sample,]
glm(MannerOfDeath~., data = q4_model_data, family = 'binomial') -> q4_logit_model## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q4_test_data$prediction <- predict.glm(q4_logit_model, newdata = q4_test_data , type = "response")
q4_model_data$prediction <- predict.glm(q4_logit_model, newdata = q4_model_data , type = "response")
q4_logit_model %>% summary.glm()##
## Call:
## glm(formula = MannerOfDeath ~ ., family = "binomial", data = q4_model_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.49 0.00 0.00 0.00 8.49
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.725e+15 1.447e+07 1.192e+08 <2e-16 ***
## ResidentStatus 1.669e+14 5.730e+05 2.913e+08 <2e-16 ***
## Education2003Revision 4.180e+13 1.827e+05 2.288e+08 <2e-16 ***
## Sex -3.716e+12 7.470e+05 -4.974e+06 <2e-16 ***
## Age 1.113e+12 1.737e+04 6.409e+07 <2e-16 ***
## MaritalStatus -2.187e+13 3.134e+05 -6.978e+07 <2e-16 ***
## MethodOfDisposition 3.094e+13 1.957e+05 1.582e+08 <2e-16 ***
## Autopsy -4.483e+14 3.453e+05 -1.298e+09 <2e-16 ***
## ActivityCode -1.615e+13 4.048e+04 -3.990e+08 <2e-16 ***
## PlaceOfInjury -2.071e+13 3.234e+04 -6.404e+08 <2e-16 ***
## Icd10Code -2.181e+13 3.085e+04 -7.069e+08 <2e-16 ***
## CauseRecode39 1.090e+14 4.642e+05 2.347e+08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 56257 on 47755 degrees of freedom
## Residual deviance: 237456 on 47744 degrees of freedom
## AIC: 237480
##
## Number of Fisher Scoring iterations: 25
# suicide is TRUE, murder is FALSE
q4_test_data %>% mutate(prediction = predict.glm(q4_logit_model,q4_test_data ,type = 'response')) %>%
select(prediction, MannerOfDeath) %>% mutate(prediction_end = prediction > 0.5) -> q4_calc_data
q4_calc_data %>% summarise(P = sum(prediction_end),
N = n() - sum(prediction_end),
TP = sum(prediction_end & MannerOfDeath),
TN = sum(!prediction_end & !MannerOfDeath),
FP = sum(prediction_end & !MannerOfDeath),
FN = sum(!prediction_end & MannerOfDeath)) %>%
mutate(ACC = (TP+TN)/(P+N),
FPR = 1- (TN/N),
TPR = TP/P) %>% print## # A tibble: 1 x 9
## P N TP TN FP FN ACC FPR TPR
## <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 9222 2718 8452 2620 770 98 0.927 0.0361 0.917
table(q4_test_data$MannerOfDeath, ifelse(predict.glm(q4_logit_model, q4_test_data, type = 'response')>0.5,1,0)) %>%
plot()cm_info <- ConfusionMatrixInfo(data = q4_test_data, predict = "prediction",actual = "MannerOfDeath", cutoff = .5 )
cm_info$plot## Warning: Removed 5864 rows containing missing values (geom_point).
accuracy_info <- AccuracyCutoffInfo(train = q4_model_data, test = q4_test_data, predict = "prediction", actual = "MannerOfDeath")
accuracy_info$plotcalculate_acc <- function(data, cutoff) {
P <- sum(data$prediction > cutoff)
N <- sum(data$prediction <= cutoff)
TP <- sum((data$prediction > cutoff) & (data$MannerOfDeath == 1))
TN <- sum(!(data$prediction > cutoff) & !(data$MannerOfDeath == 1))
ACC <- (TP+TN)/(P + N)
return(ACC)
}
draw_acc_diag <- function(gaps) {
sapply(seq(0, 1, gaps), function(x) {calculate_acc(q4_test_data, x)}) -> all_test_acc_vals
sapply(seq(0, 1, gaps), function(x) {calculate_acc(q4_model_data, x)}) -> all_model_acc_vals
rbind(
cbind(
seq(0, 1, gaps) %>% as.data.frame %>% rename(cutoff = '.'),
accuracy = all_test_acc_vals,
data = 'test'
)
,
cbind(
seq(0, 1, gaps) %>% as.data.frame %>% rename(cutoff = '.'),
accuracy = all_model_acc_vals,
data = 'train'
)
) %>% hchart(type = 'line', hcaes(x = cutoff, y = accuracy, group = data))
}
draw_acc_diag(0.005)۶. نمودار صحت مدل (accuracy) را بر حسب مقادیر مختلف قطع برای داده تست رسم نمایید. کدام پارامتر قطع بالاترین صحت را در پیش بینی داراست؟
The diagram was the last diagram of the previous question! :D
based on the diagram the best cutoff value is about 0.443.
###############################################################################################
## Q6
# draw_acc_diag(0.0005)۷. نمودار ROC را برای داده های قسمت قبل رسم نمایید. همچنین نقطه مربوط به بهترین پارامتر قطع را مشخص نمایید.
The ROC diagram is drawn and can be found bellow.
based on this diagram, the best cutoff value is also about 0.443.
###############################################################################################
## Q7
cost_fp = 100
cost_fn = 200
roc_info = ROCInfo(data = cm_info$data, predict = "predict",
actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn)
grid.draw(roc_info$plot)۸. با قرار دادن کمیت nfolds = 5 و با استفاده از H20 مدل مساله را بسازید و نتیجه حاصل را ارزیابی کنید.
In order to use h2o package for creating the generalized linear regression model, we’ve first reformatted the data to be compatible with the package.
Then using h2o.glm model we’ve created the model.
statistics of the model show extreme accuracy!
###############################################################################################
## Q8
library(h2o)##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init()## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 51 minutes
## H2O cluster timezone: Asia/Tehran
## H2O data parsing timezone: UTC
## H2O cluster version: 3.19.0.4257
## H2O cluster version age: 19 days
## H2O cluster name: H2O_started_from_R_mohammadmahdi_rem508
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.64 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.4 (2018-03-15)
q8_data <- as.h2o(q5_data)##
|
| | 0%
|
|=================================================================| 100%
h2o_model <- h2o.glm(y = "MannerOfDeath", x= colnames(q8_data),
training_frame = q8_data, family="binomial" ,nfolds = 5)## Warning in .verify_dataxy(training_frame, x, y): removing response variable
## from the explanatory variables
##
|
| | 0%
|
|= | 1%
|
|========================================================== | 89%
|
|=================================================================| 100%
h2o_model %>% print## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model ID: GLM_model_R_1524826888756_55
## GLM Model: summary
## family link regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 5.018E-4 )
## number_of_predictors_total number_of_active_predictors
## 1 11 11
## number_of_iterations training_frame
## 1 6 q5_data
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 50.357145 1.708069
## 2 ResidentStatus 0.080617 0.043723
## 3 Education2003Revision 0.167756 0.324389
## 4 Sex 0.317856 0.131790
## 5 Age 0.017433 0.326704
## 6 MaritalStatus -0.145887 -0.145127
## 7 MethodOfDisposition 0.125112 0.221123
## 8 Autopsy -1.326282 -1.226362
## 9 ActivityCode -0.262539 -3.221947
## 10 PlaceOfInjury -0.156193 -2.280730
## 11 Icd10Code -0.216646 -3.651898
## 12 CauseRecode39 0.080348 0.094597
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.02031003
## RMSE: 0.1425133
## LogLoss: 0.1629414
## Mean Per-Class Error: 0.02279234
## AUC: 0.986859
## Gini: 0.973718
## R^2: 0.8986938
## Residual Deviance: 21305.49
## AIC: 21329.49
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 15918 646 0.039000 =646/16564
## 1 284 42848 0.006584 =284/43132
## Totals 16202 43494 0.015579 =930/59696
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.447941 0.989264 216
## 2 max f2 0.378976 0.992744 232
## 3 max f0point5 0.620604 0.989095 182
## 4 max accuracy 0.461124 0.984421 213
## 5 max precision 0.963038 0.993562 50
## 6 max recall 0.000041 1.000000 399
## 7 max specificity 0.999686 0.994446 0
## 8 max absolute_mcc 0.461124 0.961001 213
## 9 max min_per_class_accuracy 0.679682 0.977300 170
## 10 max mean_per_class_accuracy 0.555405 0.979638 196
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.02066579
## RMSE: 0.143756
## LogLoss: 0.1660724
## Mean Per-Class Error: 0.02252939
## AUC: 0.9866897
## Gini: 0.9733793
## R^2: 0.8969193
## Residual Deviance: 21676.75
## AIC: 21700.75
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 15939 625 0.037732 =625/16564
## 1 316 42816 0.007326 =316/43132
## Totals 16255 43441 0.015763 =941/59696
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.460862 0.989131 221
## 2 max f2 0.383037 0.992572 238
## 3 max f0point5 0.614252 0.988985 189
## 4 max accuracy 0.460862 0.984237 221
## 5 max precision 0.939405 0.993349 66
## 6 max recall 0.000018 1.000000 399
## 7 max specificity 0.999641 0.994868 0
## 8 max absolute_mcc 0.460862 0.960540 221
## 9 max min_per_class_accuracy 0.678087 0.976878 175
## 10 max mean_per_class_accuracy 0.573033 0.979460 197
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid
## accuracy 0.98458874 5.956392E-4 0.98386693 0.98522586 0.9850883
## auc 0.9866854 6.315878E-4 0.98773897 0.98779094 0.98619944
## err 0.015411257 5.956392E-4 0.016133077 0.014774166 0.014911696
## err_count 184.0 7.1554174 193.0 175.0 179.0
## f0point5 0.9877865 4.063343E-4 0.9879097 0.9877563 0.988348
## cv_4_valid cv_5_valid
## accuracy 0.98545027 0.98331237
## auc 0.9859518 0.98574597
## err 0.014549712 0.016687632
## err_count 174.0 199.0
## f0point5 0.98820263 0.98671573
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## precision 0.9867373 4.8671348E-4 0.9872102 0.9864017 0.98746115
## r2 0.8969126 0.0013266645 0.8964911 0.8991272 0.89815253
## recall 0.9920066 6.4188975E-4 0.9907174 0.9932124 0.99191123
## residual_deviance 4335.3506 230.07188 4615.1147 4451.337 4707.6543
## rmse 0.14374892 9.606604E-4 0.14293139 0.14238524 0.14314623
## specificity 0.96528 0.001240069 0.96540004 0.9645454 0.96746266
## cv_4_valid cv_5_valid
## precision 0.987061 0.9855525
## r2 0.8971822 0.8936099
## recall 0.9927957 0.99139637
## residual_deviance 3906.3455 3996.3013
## rmse 0.14403123 0.14625049
## specificity 0.9665971 0.9623947
۹. آیا ما میتوانیم سرویسی به قضات ارایه کنیم تا با استفاده از اطلاعات مرگ بتوانند موارد مشکوک به قتل را از خودکشی تفکیک دهند؟
The model we’ve developed in this homework has very high accuracy and very low error rate.
The high R^2 value indicates that the model is well describing the data distribution.
also the accuracy vs cutoff diagram suggests that we have not performed an overfiting act in training this model.
So based on this information, I think that this model is a powerful model to be used in even in law courts.